Take-Home Exercise 3

In this article, we will build hedonic pricing models to examine the effect of different factors such as proxmity to certain facilities and number of facilities within a certain radius. The hedonic pricing models will be built using Geographical Weighted Regression.

Darryl Kwok https://example.com/darrylkwok
10-22-2021

Introduction

Overview

Installing relevant packages

Show code
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'httr', 'jsonlite', 'matrixStats', 'raster', 'geosphere', 'units')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Data

The following datasets will be used for our analysis:

Geospatial Data Import and Preparation

MP14_SUBZONE_WEB_PL is the URA Master Plan planning subzone boundaries.

Show code
mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL") %>%
  st_transform(3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

child-care-services-geojson is the geospatial information of the Childcare Services in Singapore.

Show code
childcare_sf <- st_read("data/geospatial/child-care-services-geojson.geojson") %>%
  st_transform(crs = 3414)
Reading layer `child-care-services-geojson' from data source `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial\child-care-services-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1545 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6824 ymin: 1.248403 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

ELDERCARE is the geospatial information of the Elderly Care Services in Singapore.

Show code
eldercare_sf <- st_read(dsn="data/geospatial", layer = "ELDERCARE") %>% 
  st_transform(3414)
Reading layer `ELDERCARE' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21

MRTLRTStnPtt is the geospatial information of the MRT and LRT Stations in Singapore.

Show code
mrtlrt_sf <- st_read(dsn="data/geospatial", layer = "MRTLRTStnPtt") %>% 
  st_transform(3414)
Reading layer `MRTLRTStnPtt' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 171 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6138.311 ymin: 27555.06 xmax: 45254.86 ymax: 47854.2
Projected CRS: SVY21

parks-geojson is the geospatial information of the Parks in Singapore.

Show code
parks_sf <- st_read("data/geospatial/parks-geojson.geojson")  %>% 
  st_transform(3414)
Reading layer `parks-geojson' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial\parks-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 350 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214058 xmax: 104.0017 ymax: 1.461503
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

kindergartens-geojson is the geospatial information of the Kindergartens in Singapore.

Show code
kinder_sf <- st_read("data/geospatial/kindergartens-geojson.geojson") %>% 
  st_transform(3414)
Reading layer `kindergartens-geojson' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial\kindergartens-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 448 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

hawker-centres-geojson is the geospatial information of the Hawker Centres in Singapore.

Show code
hawkers_sf <- st_read("data/geospatial/hawker-centres-geojson.geojson") %>% 
  st_transform(3414)
Reading layer `hawker-centres-geojson' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial\hawker-centres-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449217
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

supermarkets-geojson is the geospatial information of the Supermarkets in Singapore.

Show code
supermkts_sf <- st_read("data/geospatial/supermarkets-geojson.geojson") %>% 
  st_transform(3414)
Reading layer `supermarkets-geojson' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

busStop is the geospatial information of the BusStops in Singapore.

Show code
busstops_sf <- st_read(dsn="data/geospatial", layer = "BusStop")%>% 
  st_transform(3414)
Reading layer `BusStop' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5156 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4427.938 ymin: 26482.1 xmax: 48282.5 ymax: 52983.82
Projected CRS: SVY21

Create a function to preprocess the geospatial datasets

The actions taken will be:

Show code
geo_preprocess <- function(df, add_col) {
  df <- df[!duplicated(df[, add_col]), ] %>% 
    st_make_valid()
    
  return(df)
}

Execute the Preprocessing function on the geospatial datasets

The Geospatial Pre-processing function will be executed on all the Geospatial Data imported.

Show code
mpsz <- geo_preprocess(mpsz, "geometry")
childcare_sf <- geo_preprocess(childcare_sf, "geometry")
eldercare_sf <- geo_preprocess(eldercare_sf, "ADDRESSPOS")
mrtlrt_sf <- geo_preprocess(mrtlrt_sf, "STN_NAME")
parks_sf <- geo_preprocess(parks_sf, "geometry")
kinder_sf <- geo_preprocess(kinder_sf, "geometry")
hawkers_sf <- geo_preprocess(hawkers_sf, "geometry")
supermkts_sf <- geo_preprocess(supermkts_sf, "geometry")
busstops_sf <- geo_preprocess(busstops_sf, "geometry")

Aspatial Data Import and Preparation

We will make use of the read_csv function to read the resale flat prices that is obtained from data.gov.sg

Show code
resale_prices <- read_csv("data/aspatial/resale-flat-prices-jan-2019-to-sept-2020.csv")

Since there are occurrences of “ST.” in the CSV file, we will replace the “ST.” occurrences to “SAINT”

Show code
resale_prices$street_name <- gsub("ST\\.", "SAINT", resale_prices$street_name)

We can see that the data read in from the CSV file, does not contain any coordinates. Therefore we will have to perform geocoding.

Create a Geocoding Function

This function calls the search function of the commonapi of OneMap.

Since the search function of the API does not require a token, we will not use a token here.

The steps taken are:

Show code
geocode_addr <- function(block, street_name) {
  url <- "https://developers.onemap.sg/commonapi/search"
  
  search_addr <- paste(block, street_name, sep = " ")
  
  query <- list("searchVal" = search_addr, "returnGeom" = "Y", "getAddrDetails" = "N", "pageNum" = "1")
  
  res <- GET(url, query = query)
  
  jsonRespText<-content(res,as="text") 
  output <- fromJSON(jsonRespText)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)
  
  return(output)
}

Execute the Geocoding function defined above to all the rows within the dataset imported from resale-flat-prices-jan-2019-to-sept-2020.csv

Show code
resale_prices$LATITUDE <- 0
resale_prices$LONGITUDE <- 0

for  (i in 1:nrow(resale_prices)) {
  temp_output <- geocode_addr(resale_prices[i, 4], resale_prices[i, 5])
  
  resale_prices$LATITUDE[i] <- temp_output$results.LATITUDE
  resale_prices$LONGITUDE[i] <- temp_output$results.LONGITUDE
  
}

This function converts the actual remaining lease period from a string format to a double format

Steps Taken: - Split the string within the remaining_lease column. - If the string length is 4, it contains months. Else it contains years only. - Output the final value as a double.

Show code
cal_remain_lease <- function(each_col) {
  entry_list <- unlist(strsplit(each_col, '\\s+')[[1]])
  
  if (length(entry_list) > 2) {
    years <- as.numeric(unlist(entry_list[1]))
    months <-as.numeric(unlist(entry_list[3]))
    total_dur <- years + round((months/12), 2)
  } else {
    years <- as.numeric(unlist(entry_list[1]))
    total_dur <- years
  }
  
  return(unlist(total_dur))
}

Calculate the remaining lease duration for each record by executing the function defined above.

Show code
resale_prices$remaining_lease <- sapply(resale_prices$remaining_lease, cal_remain_lease)

Create a Dataframe to store the Central Business District of Singapore

The Downtown Core is one of the central districts of Singapore. Located in the Marina Bay Area, in the SouthWest part of the country.

After creating the dataframe, convert the dataframe into a sf object

Show code
lat <- 1.287953
long <- 103.851784
cbd_coor_sf <- data.frame(lat, long) %>%
  st_as_sf(., coords = c("long", "lat"), crs=4326) %>%
  st_transform(crs=3414)

Define functions to generate new columns

This function is to find the distance to the nearest facility

Steps Taken: - Make use of st_distance method to calculate the distances between all the HDBs and the facility in question. - A matrix will be generated from the function st_function - rowMins is used to get the shortest distance for each row within the matrix.

Show code
prox_prep <- function(prim_df, sec_df, var_name) {
  dist_matrix <- st_distance(prim_df, sec_df) %>% 
    drop_units()
  prim_df[,var_name] <- rowMins(dist_matrix) 
  
  return(prim_df)
}

This function is to count the number of facility within a radius

Steps Taken: - Make use of st_distance method to calculate the distances between all the HDBs and the facility in question. - Convert the matrix generated into a dataframe. - Filter out and count the number of facilities that are within a certain radius.

Show code
num_prep <- function(prim_df, sec_df, radius, var_name) {
  dist_matrix <- st_distance(prim_df, sec_df) %>% 
    drop_units()

  dist_matrix <- data.frame(dist_matrix)
  prim_df[,var_name] <- rowSums(dist_matrix <= radius)
  
  return(prim_df)
}

Create Facility Proximity columns

Show code
resale_prices_sf <- prox_prep(resale_prices_sf, eldercare_sf, "PROX_ELDER") %>%
  prox_prep(., mrtlrt_sf, "PROX_MRTLRT") %>%
  prox_prep(., parks_sf, "PROX_PARK") %>% 
  prox_prep(., hawkers_sf, "PROX_HAWKER") %>%
  prox_prep(., cbd_coor_sf, "PROX_CBD") %>% 
  prox_prep(., supermkts_sf, "PROX_SPRMKTS")

Create number of Facilities columns

Show code
resale_prices_sf <- num_prep(resale_prices_sf, childcare_sf, 350, "NUM_CC") %>%
  num_prep(., kinder_sf, 350, "NUM_KINDER") %>%
  num_prep(., busstops_sf, 350, "NUM_BUSSTOPS")

In order to save time having to rerun the pre-processing functions again, we will write the dataframe to a csv and read in when we need it.

Note: This is done after all the preprocessing and creation of new columns

After writing the sf object to a shapefile, the column names will be truncated.

Show code
st_write(resale_prices_sf, "data/final_resale_info.shp")

Read in the shapefile that contains all the information created earlier

Show code
resale_prices_sf <- st_read(dsn="data", layer="final_resale_info")
Reading layer `final_resale_info' from data source 
  `C:\darrylkwok\IS415_blog\_posts\2021-10-22-take-home-exercise-3\data' 
  using driver `ESRI Shapefile'
Simple feature collection with 15901 features and 20 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11597.31 ymin: 28217.39 xmax: 42623.63 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM

Visualise the different columns of the resale_prices_sf

Show code
glimpse(resale_prices_sf)
Rows: 15,901
Columns: 21
$ month    <chr> "2019-01", "2019-01", "2019-01", "2019-01~
$ town     <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO",~
$ flt_typ  <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "~
$ block    <chr> "204", "175", "543", "118", "411", "546",~
$ strt_nm  <chr> "ANG MO KIO AVE 3", "ANG MO KIO AVE 4", "~
$ stry_rn  <chr> "01 TO 03", "07 TO 09", "01 TO 03", "04 T~
$ flr_r_s  <dbl> 92, 91, 92, 99, 92, 92, 92, 92, 93, 91, 9~
$ flt_mdl  <chr> "New Generation", "New Generation", "New ~
$ ls_cmm_  <dbl> 1977, 1981, 1981, 1978, 1979, 1981, 1978,~
$ rmnng_l  <dbl> 57.00, 61.50, 61.08, 58.33, 59.58, 61.00,~
$ rsl_prc  <dbl> 330000, 360000, 370000, 375000, 380000, 3~
$ PROX_CB  <dbl> 8824.749, 9841.309, 9560.780, 9609.731, 8~
$ PROX_EL  <dbl> 251.4065, 631.8448, 1082.4168, 347.3195, ~
$ PROX_MR  <dbl> 703.9715, 403.4297, 889.9529, 200.9786, 8~
$ PROX_PA  <dbl> 745.0859, 429.4870, 780.0777, 177.6163, 9~
$ PROX_HA  <dbl> 441.82653, 269.72560, 258.29513, 436.4751~
$ PROX_SP  <dbl> 270.8222, 310.1889, 318.7560, 458.6748, 3~
$ NUM_CC   <dbl> 6, 5, 2, 3, 3, 2, 3, 4, 3, 2, 4, 4, 4, 5,~
$ NUM_KIN  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,~
$ NUM_BUS  <dbl> 8, 8, 8, 7, 6, 9, 6, 6, 5, 4, 10, 5, 6, 9~
$ geometry <POINT [m]> POINT (29179.92 38822.08), POINT (2~

Here the truncated column names will be rename for a better understanding of the columns.

Show code
resale_prices_sf <- resale_prices_sf %>%
  rename(floor_area_sqm = flr_r_s,
         remaining_lease = rmnng_l,
         resale_price = rsl_prc,
         PROX_CBD = PROX_CB,
         PROX_ELDER = PROX_EL,
         PROX_MRTLRT = PROX_MR,
         PROX_PARK = PROX_PA,
         PROX_HAWKER = PROX_HA,
         PROX_SPRMRT = PROX_SP)

Exploratory Data Analysis

EDA is done before any modeling or steps to get a better understanding of all the variables.

Box Plots

Box Plot for the Resale Prices

Summary Statistics for Resale Prices

Show code
summary(resale_prices_sf$resale_price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 218000  353000  405000  433589  470000 1186888 

Box plot for Resale Prices

Show code
ggplot(resale_prices_sf, aes(x = '', y = resale_price)) +
  geom_boxplot() + 
  labs(x='', y='Resale Price') +
  theme_minimal()

Histograms

Create a function to plot the histogram for the variables

Distribution of the Resale Prices

Show code
ggplot(resale_prices_sf, aes(x = resale_price)) + 
  geom_histogram(fill = 'darksalmon') +
  labs(title = "Distribution of Resale Prices",
       x = "Resale Prices",
       y = 'Frequency') +
  theme_minimal()

Distribution of the independent variables

The independent variables consists of:

Show code
AREA_SQM <- ggplot(resale_prices_sf, aes(x = floor_area_sqm)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

REMAIN_LEASE <- ggplot(resale_prices_sf, aes(x = remaining_lease)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_CBD <- ggplot(resale_prices_sf, aes(x = PROX_CBD)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_ELDER <- ggplot(resale_prices_sf, aes(x = PROX_ELDER)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_MRTLRT <- ggplot(resale_prices_sf, aes(x = PROX_MRTLRT)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_PARK <- ggplot(resale_prices_sf, aes(x = PROX_PARK)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_HAWKER <- ggplot(resale_prices_sf, aes(x = PROX_HAWKER)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

PROX_SPRMRT <- ggplot(resale_prices_sf, aes(x = PROX_SPRMRT)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

NUM_CC <- ggplot(resale_prices_sf, aes(x = NUM_CC)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

NUM_KIN <- ggplot(resale_prices_sf, aes(x = NUM_KIN)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

NUM_BUS <- ggplot(resale_prices_sf, aes(x = NUM_BUS)) + 
  geom_histogram(bins=20, fill = 'lightblue') +
  theme_minimal()

ggarrange(AREA_SQM, REMAIN_LEASE, PROX_CBD, PROX_ELDER, PROX_MRTLRT, PROX_PARK, PROX_HAWKER, PROX_SPRMRT, NUM_CC, NUM_KIN, NUM_BUS, ncol = 3, nrow = 4)

Top 10 Towns with the highest Resale Prices

Create a new dataframe called towns_avg. This dataframe consists of the average Resale Price of each town.

After calculating the average, rescale the resale price by dividing it by 100000. This is to ensure that the values are more easily understandable when plotted.

Show code
towns_avg <- aggregate(resale_prices_sf[,"resale_price"], list(resale_prices_sf$town), mean) %>%
  rename(town = Group.1)

towns_avg$resale_price <- sapply(towns_avg$resale_price, function(x) x/100000)
Show code
top10_price = top_n(towns_avg, 10, resale_price)

ggplot(top10_price, aes(x=resale_price, y=reorder(town, resale_price), label=round(resale_price, 2))) +
  geom_col(fill='darksalmon') +
  labs(title='Top 10 Towns with the highest Average Resale Prices',
       x='Resale Price ($100000)',
       y='Town') +
  geom_text(nudge_x=0.01, colour='gray23', size=3.5) +
  theme_minimal()

Statistical Point Map

This is to reveal the Geospatial distribution of the Resale Prices in Singapore.

Show code
tmap_mode("view")

tm_shape(mpsz)+
  tm_polygons() +
tm_shape(resale_prices_sf) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
Show code
tmap_mode("plot")

Regression

Multiple Linear Regression Model

Visualising the relationships of the independent variables

By visualising the relationships of the independent variables in a correlation heatmap allows us to remove highly correlated variables that will affect the accuracy of the subsequent regression models that we will build.

Show code
var_list <- c("floor_area_sqm", "remaining_lease", "PROX_CBD", "PROX_ELDER", "PROX_MRTLRT", "PROX_PARK", "PROX_HAWKER", "PROX_SPRMRT", "NUM_CC", "NUM_KIN", "NUM_BUS")

resale_prices <- resale_prices_sf %>%
  st_drop_geometry()

corrplot(cor(resale_prices[, var_list]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

Building the Multi-Linear Regression Model

Show code
resale_mlr <- lm(formula = resale_price ~ floor_area_sqm + remaining_lease + PROX_CBD + PROX_ELDER + PROX_HAWKER + PROX_MRTLRT + PROX_PARK + PROX_SPRMRT + NUM_BUS + NUM_CC + NUM_KIN,  data = resale_prices_sf) 

summary(resale_mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease + 
    PROX_CBD + PROX_ELDER + PROX_HAWKER + PROX_MRTLRT + PROX_PARK + 
    PROX_SPRMRT + NUM_BUS + NUM_CC + NUM_KIN, data = resale_prices_sf)

Residuals:
    Min      1Q  Median      3Q     Max 
-229623  -44498   -1424   36879  478253 

Coefficients:
                  Estimate Std. Error  t value Pr(>|t|)    
(Intercept)     96663.1126  8978.1407   10.766  < 2e-16 ***
floor_area_sqm   2645.2748    80.0689   33.037  < 2e-16 ***
remaining_lease  4961.6167    46.1827  107.435  < 2e-16 ***
PROX_CBD          -19.7967     0.1613 -122.749  < 2e-16 ***
PROX_ELDER        -12.2039     0.8712  -14.009  < 2e-16 ***
PROX_HAWKER       -20.8572     1.1344  -18.386  < 2e-16 ***
PROX_MRTLRT       -28.8492     1.4660  -19.680  < 2e-16 ***
PROX_PARK          -3.8584     1.3041   -2.959  0.00309 ** 
PROX_SPRMRT       -17.7018     3.6532   -4.846 1.27e-06 ***
NUM_BUS          1373.4801   195.4588    7.027 2.20e-12 ***
NUM_CC          -6200.5685   327.6457  -18.925  < 2e-16 ***
NUM_KIN          9458.3497   650.6929   14.536  < 2e-16 ***
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67820 on 15889 degrees of freedom
Multiple R-squared:  0.6815,    Adjusted R-squared:  0.6813 
F-statistic:  3090 on 11 and 15889 DF,  p-value: < 2.2e-16
Show code
ols_regress(resale_mlr)
                            Model Summary                              
----------------------------------------------------------------------
R                       0.826       RMSE                    67816.391 
R-Squared               0.681       Coef. Var                  15.641 
Adj. R-Squared          0.681       MSE                4599062877.331 
Pred R-Squared          0.681       MAE                     51385.045 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares           DF       Mean Square       F          Sig. 
--------------------------------------------------------------------------------
Regression    1.563443e+14           11      1.421312e+13    3090.439    0.0000 
Residual      7.307451e+13        15889    4599062877.331                       
Total         2.294188e+14        15900                                         
--------------------------------------------------------------------------------

                                          Parameter Estimates                                            
--------------------------------------------------------------------------------------------------------
          model         Beta    Std. Error    Std. Beta       t         Sig         lower         upper 
--------------------------------------------------------------------------------------------------------
    (Intercept)    96663.113      8978.141                   10.766    0.000    79064.940    114261.285 
 floor_area_sqm     2645.275        80.069        0.156      33.037    0.000     2488.331      2802.219 
remaining_lease     4961.617        46.183        0.531     107.435    0.000     4871.093      5052.140 
       PROX_CBD      -19.797         0.161       -0.683    -122.749    0.000      -20.113       -19.481 
     PROX_ELDER      -12.204         0.871       -0.067     -14.009    0.000      -13.911       -10.496 
    PROX_HAWKER      -20.857         1.134       -0.090     -18.386    0.000      -23.081       -18.634 
    PROX_MRTLRT      -28.849         1.466       -0.093     -19.680    0.000      -31.723       -25.976 
      PROX_PARK       -3.858         1.304       -0.014      -2.959    0.003       -6.415        -1.302 
    PROX_SPRMRT      -17.702         3.653       -0.023      -4.846    0.000      -24.863       -10.541 
        NUM_BUS     1373.480       195.459        0.033       7.027    0.000      990.359      1756.602 
         NUM_CC    -6200.569       327.646       -0.095     -18.925    0.000    -6842.791     -5558.346 
        NUM_KIN     9458.350       650.693        0.069      14.536    0.000     8182.918     10733.782 
--------------------------------------------------------------------------------------------------------

Testing the assumptions of Linear Regression

In order for us to perform regression on geographical data, we must first make sure that the 4 assumptions are met.

Multi-Collinearity Test

Show code
ols_vif_tol(resale_mlr)
         Variables Tolerance      VIF
1   floor_area_sqm 0.8948905 1.117455
2  remaining_lease 0.8219658 1.216596
3         PROX_CBD 0.6471850 1.545153
4       PROX_ELDER 0.8680196 1.152048
5      PROX_HAWKER 0.8443361 1.184362
6      PROX_MRTLRT 0.9040858 1.106090
7        PROX_PARK 0.8445421 1.184074
8      PROX_SPRMRT 0.8752257 1.142562
9          NUM_BUS 0.8938181 1.118796
10          NUM_CC 0.7977060 1.253595
11         NUM_KIN 0.8857889 1.128937

Non-linearity Test

Show code
ols_plot_resid_fit(resale_mlr)

Normality Assumption Test

Show code
ols_plot_resid_hist(resale_mlr)

Spatial Autocorrelation Test

Since the hedonic model we are building are using geographically referenced attributes, it is important for us to visualise the residuals of the pricing model.

Retrieve the residuals of the pricing model and save it as a dataframe

Show code
mlr_output <- as.data.frame(resale_mlr$residuals)

Then join the newly created dataframe to the resale_prices_sf

Show code
resale_prices_res_sf <- cbind(resale_prices_sf, 
                        mlr_output) %>% 
  rename(MLR_RES = resale_mlr.residuals)

Since the spdep package can only be used to process sp conformed spatial data objects, we will convert the resale_prices_res_sg to Spatial Objects.

Show code
resale_prices_res_sp <- as_Spatial(resale_prices_res_sf)

resale_prices_res_sp
class       : SpatialPointsDataFrame 
features    : 15901 
extent      : 11597.31, 42623.63, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 21
names       :   month,       town, flt_typ, block,       strt_nm,  stry_rn, floor_area_sqm,       flt_mdl, ls_cmm_, remaining_lease, resale_price,         PROX_CBD,        PROX_ELDER,      PROX_MRTLRT,        PROX_PARK, ... 
min values  : 2019-01, ANG MO KIO,  4 ROOM,     1,  ADMIRALTY DR, 01 TO 03,             74, Adjoined flat,    1967,            45.5,       218000, 999.393538715878, 0.000334897933105, 22.0407324774434, 44.1643212802781, ... 
max values  : 2020-09,     YISHUN,  4 ROOM,    9B, YUNG SHENG RD, 49 TO 51,            138,       Type S1,    2018,              97,      1186888, 19650.0691667807,  3301.63731686804, 2130.60636038504, 2413.13695915468, ... 

After converting, we will use the tmap package to visualise the distribution of the residuals on an interactive map for a more in-depth analysis.

Show code
tmap_mode("view")

tm_shape(mpsz)+
  tm_polygons(alpha = 0.4) +
tm_shape(resale_prices_res_sp) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
Show code
tmap_mode("plot")

Compute the distance-based weight matrix.

This will be performed by using dnearneigh function of the spdep function.

Show code
nb <- dnearneigh(coordinates(resale_prices_res_sp), 0, 1500, longlat = FALSE)

We will then convert the output neighbours list into spatial weights

Show code
nb_lw <- nb2listw(nb, style = "W")

Perform Moran’s I test for residual spatial autocorrelation

Show code
lm.morantest(resale_mlr, nb_lw)

    Global Moran I for regression residuals

data:  
model: lm(formula = resale_price ~ floor_area_sqm +
remaining_lease + PROX_CBD + PROX_ELDER + PROX_HAWKER
+ PROX_MRTLRT + PROX_PARK + PROX_SPRMRT + NUM_BUS +
NUM_CC + NUM_KIN, data = resale_prices_sf)
weights: nb_lw

Moran I statistic standard deviate = 782.27, p-value
< 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    4.031030e-01    -3.305787e-04     2.659672e-07 

Regression with Geographic Weighted Regression Models

Adaptive Bandwidth Geographic

Show code
bw_adaptive <- bw.gwr(formula = resale_price ~ floor_area_sqm + remaining_lease + PROX_CBD + PROX_ELDER + PROX_HAWKER + PROX_MRTLRT + PROX_PARK + PROX_SPRMRT + NUM_BUS + NUM_CC + NUM_KIN,  data = resale_prices_res_sp, approach="CV", kernel="gaussian", adaptive=TRUE, longlat=FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 9834 CV score: 6.683267e+13 
Adaptive bandwidth: 6086 CV score: 6.23757e+13 
Adaptive bandwidth: 3767 CV score: 5.630545e+13 
Adaptive bandwidth: 2337 CV score: 5.007026e+13 
Adaptive bandwidth: 1449 CV score: 4.029033e+13 
Adaptive bandwidth: 905 CV score: 3.355382e+13 
Adaptive bandwidth: 563 CV score: 2.829336e+13 
Adaptive bandwidth: 358 CV score: 2.426545e+13 
Adaptive bandwidth: 224 CV score: 2.109395e+13 
Adaptive bandwidth: 149 CV score: 1.905384e+13 
Adaptive bandwidth: 94 CV score: 1.722377e+13 
Adaptive bandwidth: 69 CV score: 1.626092e+13 
Adaptive bandwidth: 44 CV score: 5.190474e+13 
Adaptive bandwidth: 74 CV score: 1.642574e+13 
Adaptive bandwidth: 55 CV score: 1.571203e+13 
Adaptive bandwidth: 57 CV score: 1.577839e+13 
Adaptive bandwidth: 64 CV score: 1.610874e+13 
Adaptive bandwidth: 59 CV score: 1.584489e+13 
Adaptive bandwidth: 62 CV score: 1.60042e+13 
Adaptive bandwidth: 60 CV score: 1.595027e+13 
Adaptive bandwidth: 61 CV score: 1.59763e+13 
Adaptive bandwidth: 60 CV score: 1.595027e+13 
Adaptive bandwidth: 60 CV score: 1.595027e+13 
Adaptive bandwidth: 59 CV score: 1.584489e+13 
Adaptive bandwidth: 59 CV score: 1.584489e+13 
Adaptive bandwidth: 58 CV score: 1.582187e+13 
Adaptive bandwidth: 58 CV score: 1.582187e+13 
Adaptive bandwidth: 57 CV score: 1.577839e+13 
Adaptive bandwidth: 57 CV score: 1.577839e+13 
Adaptive bandwidth: 56 CV score: 1.574875e+13 
Adaptive bandwidth: 56 CV score: 1.574875e+13 
Adaptive bandwidth: 55 CV score: 1.571203e+13 

Construct the Geographic Weighted Regression Model

Show code
gwr_adaptive <- gwr.basic(formula = resale_price ~ floor_area_sqm + remaining_lease + PROX_CBD + PROX_ELDER + PROX_HAWKER + PROX_MRTLRT + PROX_PARK + PROX_SPRMRT + NUM_BUS + NUM_CC + NUM_KIN,  data=resale_prices_res_sp, bw=bw_adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)

gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2021-11-05 17:42:24 
   Call:
   gwr.basic(formula = resale_price ~ floor_area_sqm + remaining_lease + 
    PROX_CBD + PROX_ELDER + PROX_HAWKER + PROX_MRTLRT + PROX_PARK + 
    PROX_SPRMRT + NUM_BUS + NUM_CC + NUM_KIN, data = resale_prices_res_sp, 
    bw = bw_adaptive, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  floor_area_sqm remaining_lease PROX_CBD PROX_ELDER PROX_HAWKER PROX_MRTLRT PROX_PARK PROX_SPRMRT NUM_BUS NUM_CC NUM_KIN
   Number of data points: 15901
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-229623  -44498   -1424   36879  478253 

   Coefficients:
                     Estimate Std. Error  t value Pr(>|t|)
   (Intercept)     96663.1126  8978.1407   10.766  < 2e-16
   floor_area_sqm   2645.2748    80.0689   33.037  < 2e-16
   remaining_lease  4961.6167    46.1827  107.435  < 2e-16
   PROX_CBD          -19.7967     0.1613 -122.749  < 2e-16
   PROX_ELDER        -12.2039     0.8712  -14.009  < 2e-16
   PROX_HAWKER       -20.8572     1.1344  -18.386  < 2e-16
   PROX_MRTLRT       -28.8492     1.4660  -19.680  < 2e-16
   PROX_PARK          -3.8584     1.3041   -2.959  0.00309
   PROX_SPRMRT       -17.7018     3.6532   -4.846 1.27e-06
   NUM_BUS          1373.4801   195.4588    7.027 2.20e-12
   NUM_CC          -6200.5685   327.6457  -18.925  < 2e-16
   NUM_KIN          9458.3497   650.6929   14.536  < 2e-16
                      
   (Intercept)     ***
   floor_area_sqm  ***
   remaining_lease ***
   PROX_CBD        ***
   PROX_ELDER      ***
   PROX_HAWKER     ***
   PROX_MRTLRT     ***
   PROX_PARK       ** 
   PROX_SPRMRT     ***
   NUM_BUS         ***
   NUM_CC          ***
   NUM_KIN         ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 67820 on 15889 degrees of freedom
   Multiple R-squared: 0.6815
   Adjusted R-squared: 0.6813 
   F-statistic:  3090 on 11 and 15889 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 7.307451e+13
   Sigma(hat): 67795.06
   AIC:  398922.3
   AICc:  398922.3
   BIC:  383246.8
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 55 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                          Min.     1st Qu.      Median
   Intercept       -3.1005e+08 -4.7007e+05  5.7437e+03
   floor_area_sqm  -3.5732e+04  1.4123e+03  2.2649e+03
   remaining_lease -2.1363e+04  3.3263e+03  5.1325e+03
   PROX_CBD        -1.7724e+04 -6.3912e+01 -1.5970e+01
   PROX_ELDER      -1.4109e+04 -5.0652e+01 -1.4241e+00
   PROX_HAWKER     -2.8317e+05 -5.9873e+01 -3.4871e+00
   PROX_MRTLRT     -1.2518e+04 -1.2285e+02 -6.0110e+01
   PROX_PARK       -1.1162e+04 -6.5740e+01 -6.8723e+00
   PROX_SPRMRT     -4.7879e+03 -5.2242e+01 -7.4794e+00
   NUM_BUS         -1.7419e+05 -2.0836e+03 -8.1557e+01
   NUM_CC          -1.0660e+05 -2.3138e+03  3.9164e+02
   NUM_KIN         -8.9843e+06 -7.4884e+03 -3.8448e+02
                       3rd Qu.       Max.
   Intercept        5.6784e+05 2.6823e+08
   floor_area_sqm   3.5340e+03 3.9182e+05
   remaining_lease  7.2972e+03 3.0858e+04
   PROX_CBD         2.3686e+01 1.7229e+04
   PROX_ELDER       4.9068e+01 1.0912e+05
   PROX_HAWKER      5.3232e+01 1.3342e+04
   PROX_MRTLRT     -3.4825e+00 2.5484e+05
   PROX_PARK        4.8334e+01 9.6436e+03
   PROX_SPRMRT      4.9867e+01 9.9058e+03
   NUM_BUS          1.8812e+03 5.3515e+04
   NUM_CC           3.4879e+03 8.0181e+04
   NUM_KIN          5.1641e+03 5.0395e+06
   ************************Diagnostic information*************************
   Number of data points: 15901 
   Effective number of parameters (2trace(S) - trace(S'S)): 1510.406 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 14390.59 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 374228.1 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 372806 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 367466.8 
   Residual sum of squares: 1.311961e+13 
   R-square value:  0.9428137 
   Adjusted R-square value:  0.9368112 

   ***********************************************************************
   Program stops at: 2021-11-05 17:45:39 

Visualising GWR Output

In order to better understand the Geographic Weighted Regression Model’s output, we will visualise it.

Converting SDF into sf dataframe

Show code
resale_prices_sf_adaptive <- st_as_sf(gwr_adaptive$SDF) %>%
  st_transform(crs=3414)
Show code
tmap_mode("view")
tm_shape(mpsz)+
  tm_polygons(alpha = 0.1) +
tm_shape(resale_prices_sf_adaptive) +  
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))